16 research outputs found

    Design and implementation of in-network coherence

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.Title as it appears in MIT Commencement Exercises program, June 2013: Design and implementation of in-network coherence. Cataloged from PDF version of thesis.Includes bibliographical references (p. 101-104).CMOS technology scaling has enabled increasing transistor density on chip. At the same time, multi-core processors that provide increased performance, vis-a'-vis power efficiency, have become prevalent in a power constrained environment. The shared memory model is a predominant paradigm in such systems, easing programmability and increasing portability. However with memory being shared by an increasing number of cores, a scalable coherence mechanism is imperative for these systems. Snoopy coherence has been a favored coherence scheme owing to its high performance and simplicity. However there are few viable proposals to extend snoopy coherence to unordered interconnects - specifically, modular packet-switched interconnects that have emerged as a scalable solution to the communication challenges in the CMP era. This thesis proposes a distributed in-network global ordering scheme that enables snoopy coherence on unordered interconnects. The proposed scheme is realized on a two-dimensional mesh interconnection network, referred to as OMNI (Ordered Mesh Network Interconnect). OMNI is an enabling solution for the SCORPIO processor prototype developed at MIT - a 36-core chip multi-processor supporting snoopy coherence, and fabricated in a commercial 45nm technology. OMNI is shown to be effective, reducing runtime by 36% in comparison to directory and Hammer coherence protocol implementations. The OMNI network achieves an operating frequency of 833 MHz post-layout, occupies 10% of the chip area, and consumes less than 100mW of power.by Suvinay Subramanian.S.M

    FLAT: An Optimized Dataflow for Mitigating Attention Performance Bottlenecks

    Full text link
    Attention mechanisms form the backbone of state-of-the-art machine learning models for a variety of tasks. Deploying them on deep neural network (DNN) accelerators, however, is prohibitively challenging especially under long sequences, as this work identifies. This is due to operators in attention layers exhibiting limited reuse opportunities and quadratic growth in memory footprint, leading to severe memory-boundedness. To address this, we introduce a new attention-tailored dataflow, termed FLAT, which identifies fusion opportunities within the attention layer, and implements an on-chip memory-aware interleaved execution and tiling mechanism. FLAT increases the effective memory bandwidth by efficiently utilizing the high-bandwidth, low-capacity on-chip buffer and thus achieves better run time and compute resource utilization. In our evaluation, FLAT achieves 1.94x and 1.76x speedup and 49% and 42% of energy reduction comparing to baseline execution over state-of-the-art edge and cloud accelerators

    A scalable architecture for ordered parallelism

    Get PDF
    We present Swarm, a novel architecture that exploits ordered irregular parallelism, which is abundant but hard to mine with current software and hardware techniques. In this architecture, programs consist of short tasks with programmer-specified timestamps. Swarm executes tasks speculatively and out of order, and efficiently speculates thousands of tasks ahead of the earliest active task to uncover ordered parallelism. Swarm builds on prior TLS and HTM schemes, and contributes several new techniques that allow it to scale to large core counts and speculation windows, including a new execution model, speculation-aware hardware task management, selective aborts, and scalable ordered commits. We evaluate Swarm on graph analytics, simulation, and database benchmarks. At 64 cores, Swarm achieves 51--122× speedups over a single-core system, and out-performs software-only parallel algorithms by 3--18×.National Science Foundation (U.S.) (Award CAREER-145299

    SMART: A Single-Cycle Reconfigurable NoC for SoC Applications

    Get PDF
    As technology scales, SoCs are increasing in core counts, leading to the need for scalable NoCs to interconnect the multiple cores on the chip. Given aggressive SoC design targets, NoCs have to deliver low latency, high bandwidth, at low power and area overheads. In this paper, we propose Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART) NoC, a NoC that reconfigures and tailors a generic mesh topology for SoC applications at runtime. The heart of our SMART NoC is a novel low-swing clockless repeated link circuit embedded within the router crossbars, that allows packets to potentially bypass all the way from source to destination core within a single clock cycle, without being latched at any intermediate router. Our clockless repeater link has been proven in silicon in 45nm SOI. Results show that at 2GHz, we can traverse 8mm within a single cycle, i.e. 8 hops with 1mm cores. We implement the SMART NoC to layout and show that SMART NoC gives 60% latency savings, and 2.2X power savings compared to a baseline mesh NoC.United States. Defense Advanced Research Projects Agency. The Ubiquitous High-Performance Computing Progra

    Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask

    Full text link
    Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DNNs). Among different categories of sparsity, structured sparsity has gained more attention due to its efficient execution on modern accelerators. Particularly, N:M sparsity is attractive because there are already hardware accelerator architectures that can leverage certain forms of N:M structured sparsity to yield higher compute-efficiency. In this work, we focus on N:M sparsity and extensively study and evaluate various training recipes for N:M sparsity in terms of the trade-off between model accuracy and compute cost (FLOPs). Building upon this study, we propose two new decay-based pruning methods, namely "pruning mask decay" and "sparse structure decay". Our evaluations indicate that these proposed methods consistently deliver state-of-the-art (SOTA) model accuracy, comparable to unstructured sparsity, on a Transformer-based model for a translation task. The increase in the accuracy of the sparse model using the new training recipes comes at the cost of marginal increase in the total training compute (FLOPs).Comment: 11 pages, 2 figures, and 9 tables. Published at the ICML Workshop on Sparsity in Neural Networks Advancing Understanding and Practice, 2022. First two authors contributed equall

    Penerapan Model Pembelajaran Problem Based Learning (Pbl) untuk Meningkatkan Hasil Belajar Siswa pada Pokok Bahasan Kelarutan dan Hasil Kali Kelarutan di Kelas XI IPA SMA Negeri 1 Kampar

    Get PDF
    Research was aimed to improve study result of student on solubility and result times solubility subject has been doing in class XI science SMAN 1 Kampar. This research was a form of experiments research with design pretest-posttest. Sample of the research were student of class XI science 2 as experimental class and class XI science 3 as control class. The experimental class was applied learning model problem base learning while the control class using discussion method. Data were analized using t- test. Result from the data analysis showed t count > t table (1,6923 > 1,68). It means learning model problem based learning can increase study result of student on solubility and result times solubility subject in class XI science SMAN 1 Kampar with category of increase study results on the solubility and result times solubility in class XI science is high category

    SCORPIO: A 36-Core Research Chip Demonstrating Snoopy Coherence on a Scalable Mesh NoC with In-Network Ordering

    Get PDF
    URL to conference programIn the many-core era, scalable coherence and on-chip interconnects are crucial for shared memory processors. While snoopy coherence is common in small multicore systems, directory-based coherence is the de facto choice for scalability to many cores, as snoopy relies on ordered interconnects which do not scale. However, directory-based coherence does not scale beyond tens of cores due to excessive directory area overhead or inaccurate sharer tracking. Prior techniques supporting ordering on arbitrary unordered networks are impractical for full multicore chip designs. We present SCORPIO, an ordered mesh Network-on-Chip(NoC) architecture with a separate fixed-latency, bufferless network to achieve distributed global ordering. Message delivery is decoupled from the ordering, allowing messages to arrive in any order and at any time, and still be correctly ordered. The architecture is designed to plug-and-play with existing multicore IP and with practicality, timing, area, and power as top concerns. Full-system 36 and 64-core simulations on SPLASH-2 and PARSEC benchmarks show an average application run time reduction of 24.1% and 12.9%, in comparison to distributed directory and AMD HyperTransport coherence protocols, respectively. The SCORPIO architecture is incorporated in an 11 mm-by- 13 mm chip prototype, fabricated in IBM 45nm SOI technology, comprising 36 Freescale e200 Power Architecture TM cores with private L1 and L2 caches interfacing with the NoC via ARM AMBA, along with two Cadence on-chip DDR2 controllers. The chip prototype achieves a post synthesis operating frequency of 1 GHz (833 MHz post-layout) with an estimated power of 28.8 W (768 mW per tile), while the network consumes only 10% of tile area and 19 % of tile power.United States. Defense Advanced Research Projects Agency (DARPA UHPC grant at MIT (Angstrom))Center for Future Architectures ResearchMicroelectronics Advanced Research Corporation (MARCO)Semiconductor Research Corporatio

    TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings

    Full text link
    In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) and its third supercomputer for such ML models. Optical circuit switches (OCSes) dynamically reconfigure its interconnect topology to improve scale, availability, utilization, modularity, deployment, security, power, and performance; users can pick a twisted 3D torus topology if desired. Much cheaper, lower power, and faster than Infiniband, OCSes and underlying optical components are <5% of system cost and <3% of system power. Each TPU v4 includes SparseCores, dataflow processors that accelerate models that rely on embeddings by 5x-7x yet use only 5% of die area and power. Deployed since 2020, TPU v4 outperforms TPU v3 by 2.1x and improves performance/Watt by 2.7x. The TPU v4 supercomputer is 4x larger at 4096 chips and thus ~10x faster overall, which along with OCS flexibility helps large language models. For similar sized systems, it is ~4.3x-4.5x faster than the Graphcore IPU Bow and is 1.2x-1.7x faster and uses 1.3x-1.9x less power than the Nvidia A100. TPU v4s inside the energy-optimized warehouse scale computers of Google Cloud use ~3x less energy and produce ~20x less CO2e than contemporary DSAs in a typical on-premise data center.Comment: 15 pages; 16 figures; to be published at ISCA 2023 (the International Symposium on Computer Architecture

    JaxPruner: A concise library for sparsity research

    Full text link
    This paper introduces JaxPruner, an open-source JAX-based pruning and sparse training library for machine learning research. JaxPruner aims to accelerate research on sparse neural networks by providing concise implementations of popular pruning and sparse training algorithms with minimal memory and latency overhead. Algorithms implemented in JaxPruner use a common API and work seamlessly with the popular optimization library Optax, which, in turn, enables easy integration with existing JAX based libraries. We demonstrate this ease of integration by providing examples in four different codebases: Scenic, t5x, Dopamine and FedJAX and provide baseline experiments on popular benchmarks.Comment: Jaxpruner is hosted at http://github.com/google-research/jaxprune
    corecore